These exercises are about manipulate single-cell data with Bioconductor packages. Please download the counting matrix from BOX and loading them into a Seurat object. Or you may also used the rds file data/scSeq_CKO_sceSub.rds.

If you want to load the whole dataset

library(DropletUtils)
library(DropletTestFiles)
fname <- "~path to the 10X counting matrix"
sce <- read10xCounts(fname, col.names=TRUE)
saveRDS(sce,"path to rds file")

If you want to load a subset of the data

library(DropletUtils)
library(DropletTestFiles)
sce <- readRDS("data/scSeq_CKO_sceSub.rds")

Exercise 2 - Data manipulation with Bioconductor packages

Identify empty droplets

  1. Draw a knee plot and identify inflection point and knee point.
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

  1. Identify non-empty droplets and compare to the results with hard cut-off
  • set limit as “100” (filter out droplets with UMI counts less than 100)
  • FDR cut-off for non-empty droplet is 0.001
## DataFrame with 400000 rows and 5 columns
##                        Total   LogProb    PValue   Limited       FDR
##                    <integer> <numeric> <numeric> <logical> <numeric>
## CCACGGAAGCTCTCGG-1         0        NA        NA        NA        NA
## TGTGGTAAGAGTCGGT-1         2  -11.1237 0.7561244     FALSE   0.99174
## CCCTCCTTCGTCTGCT-1         2  -18.2572 0.0727927     FALSE   0.66022
## ACTGAGTGTGTCGCTG-1         0        NA        NA        NA        NA
## CGTAGCGTCTCTGCTG-1         0        NA        NA        NA        NA
## ...                      ...       ...       ...       ...       ...
## AGAGCTTGTCCGACGT-1         0        NA        NA        NA        NA
## GCTGCGACAATAGCAA-1         0        NA        NA        NA        NA
## CGAACATAGTGAAGTT-1         2  -12.3017  0.632437     FALSE   0.99174
## TAGCCGGCATGTCGAT-1         0        NA        NA        NA        NA
## CGAGAAGAGCAGATCG-1         0        NA        NA        NA        NA
##    Mode   FALSE    TRUE    NA's 
## logical  211696    2110  186194
##        Limited
## Sig      FALSE   TRUE
##   FALSE 209919   1777
##   TRUE       1   2109

Data normalization and cluster data

  1. Normalize and cluster now that empty droplets have been removed.

Evaluate ambient RNA contamination

  1. Please estimate ambient RNA contamination, remove contaminants, and validate by using the Hba-a1 gene.
## ENSMUSG00000051951 ENSMUSG00000025902 ENSMUSG00000033845 ENSMUSG00000025903 
##       1.669575e-07       1.669575e-07       1.407972e-04       5.201547e-05 
## ENSMUSG00000104217 ENSMUSG00000033813 
##       1.669575e-07       6.783772e-05

  1. Re-cluster data after ambient RNA removal

Remove doublets

  1. Please estimate doublets and evaluate if the doublets were enriched in any clusters or not. Then try to remove the doublet cells/clusters.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0422  0.3840  0.7385  0.9630  1.3082  7.2457

## 
##   no  yes 
## 2004  106

Advanced QC plots

  1. Please estimate mitochondrial contents (is.mito), read counts (sum) and gene counts (detected) for each cell. Then, draw plots.
  • violin plots of is.mito, sum, and detected
  • scatter plots for is.mito vs sum and detected vs sum
  1. Estimate variance explanation and find the factor that contributes to the majority of variance.
## Warning: Removed 2785 rows containing non-finite values (stat_density).